[Data] Add fillna and dropna functions #54844

soffer-anyscale · 2025-07-22T22:55:44Z

Why are these changes needed?

Null handling is a basic and common ETL requirement that other data frameworks have. It is important for feature parity and for ML dataset preprocessing to have common null handling features.

This PR adds ds.fillna and ds.dropna functions, modeled after Pandas and PySpark functionalities.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

gemini-code-assist · 2025-07-22T22:55:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Copilot

Pull Request Overview

This PR adds two essential null handling functions (fillna and dropna) to Ray Data, bringing it to feature parity with pandas and PySpark for basic ETL operations and ML dataset preprocessing.

Key changes include:

Implementation of fillna method to replace missing values with scalar or column-specific values
Implementation of dropna method to remove rows containing missing values with flexible filtering options
Comprehensive test suites covering edge cases and different data types

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`python/ray/data/dataset.py`	Adds public API methods `fillna` and `dropna` with comprehensive documentation and examples
`python/ray/data/_internal/logical/operators/map_operator.py`	Implements logical operators `FillNa` and `DropNa` for the execution framework
`python/ray/data/_internal/planner/plan_udf_map_op.py`	Implements planning functions with PyArrow-based transformations for null handling
`python/ray/data/_internal/planner/planner.py`	Registers the new logical operators with their planning functions
`python/ray/data/tests/test_fillna.py`	Comprehensive test suite for fillna functionality
`python/ray/data/tests/test_dropna.py`	Comprehensive test suite for dropna functionality
`python/ray/data/BUILD`	Adds build targets for the new test files

Comments suppressed due to low confidence (6)

python/ray/data/tests/test_fillna.py:105

[nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.

    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_dropna.py:53

[nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.

    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_fillna.py:192

[nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.

    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_dropna.py:104

[nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.

    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_fillna.py:220

[nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.

    for i, (actual, exp) in enumerate(zip(rows, expected)):

python/ray/data/tests/test_dropna.py:208

[nitpick] The variable name 'i' is not used in the loop body. Consider using '_' instead to indicate it's unused.

    for i, (actual, exp) in enumerate(zip(rows, expected)):

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

github-actions · 2025-08-08T00:42:21Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

soffer-anyscale · 2025-08-08T22:15:32Z

Adding comment to remove label. Also to mention larger changes to the code structure to add these as operators.

Signed-off-by: soffer-anyscale <173827098+soffer-anyscale@users.noreply.github.com>

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

- Add advanced filling methods: forward, backward, interpolate - Add limit parameter for consecutive fill operations - Add ignore_values parameter for custom missing value definitions - Add comprehensive parameter validation and error handling - Update documentation to match Ray Data style (remove framework references) - Add comprehensive test coverage with 35 test cases covering: * All new parameters and methods * Edge cases and error conditions * Multi-block behavior * Schema preservation * Performance testing Files modified: - dataset.py: Enhanced API with new parameters - fillna_operator.py: Advanced logical operator with validation - dropna_operator.py: Enhanced with ignore_values support - plan_fillna_op.py: Physical implementation of advanced methods - plan_dropna_op.py: Support for custom missing values - test_fillna.py: Comprehensive test suite (18 tests) - test_dropna.py: Comprehensive test suite (17 tests) Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

- Format code according to Ray's black 22.10.0 standards - Fix formatting in logical operators and planners - Update test file formatting Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

- Add Any to typing imports to resolve NameError when using List[Any] type annotations - Fixes test failures caused by missing type import Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

- Fix SystemException usage in exceptions.py (remove incorrect exception chaining) - Add missing Optional import in plan_fillna_op.py - All files now pass lint checks and follow proper import conventions Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

- Notebooks were reformatted by the pre-push hook linter - No functional changes, just formatting updates Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

inital commit

abfc524

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

soffer-anyscale requested a review from a team as a code owner July 22, 2025 22:55

soffer-anyscale changed the title ~~inital commit~~ [Data] Add fillna and dropna functions Jul 22, 2025

soffer-anyscale requested a review from Copilot July 22, 2025 22:56

soffer-anyscale added the data Ray Data-related issues label Jul 22, 2025

Copilot AI reviewed Jul 22, 2025

View reviewed changes

soffer-anyscale added 2 commits July 22, 2025 15:58

updated test

5fb4436

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

updated op

c9dad25

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Aug 8, 2025

soffer-anyscale added 2 commits August 8, 2025 14:29

updated operators and tests

00939db

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

refactored based on lint

41dbe65

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

soffer-anyscale and others added 3 commits August 8, 2025 16:19

Merge branch 'master' into data_nulls

0a42615

Signed-off-by: soffer-anyscale <173827098+soffer-anyscale@users.noreply.github.com>

updated to fix lint issues

f19e51f

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Merge conflict resolution: resolved planner.py conflicts

9a9de2b

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Aug 9, 2025

soffer-anyscale and others added 10 commits August 10, 2025 16:57

updated to fix test issues

999f07a

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

updated to fix unit tests

26ff5f8

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

updated fillna test

32031fc

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Merge branch 'master' into data_nulls

6f78adf

Apply code formatting to fillna/dropna operators

8a9ec36

- Format code according to Ray's black 22.10.0 standards - Fix formatting in logical operators and planners - Update test file formatting Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Fix missing Any import in plan_dropna_op.py

cf4fa0f

- Add Any to typing imports to resolve NameError when using List[Any] type annotations - Fixes test failures caused by missing type import Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Apply automatic code formatting to dataset.py

01b303a

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

Add linter-reformatted notebook files

214d33f

- Notebooks were reformatted by the pre-push hook linter - No functional changes, just formatting updates Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

soffer-anyscale requested a review from a team as a code owner September 11, 2025 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Add fillna and dropna functions #54844

[Data] Add fillna and dropna functions #54844

Uh oh!

soffer-anyscale commented Jul 22, 2025

Uh oh!

gemini-code-assist bot commented Jul 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

soffer-anyscale commented Aug 8, 2025

Uh oh!

Uh oh!

[Data] Add fillna and dropna functions #54844

Are you sure you want to change the base?

[Data] Add fillna and dropna functions #54844

Uh oh!

Conversation

soffer-anyscale commented Jul 22, 2025

Why are these changes needed?

Checks

Uh oh!

gemini-code-assist bot commented Jul 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

soffer-anyscale commented Aug 8, 2025

Uh oh!

Uh oh!